AITopics | use eq

Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models (LLM) with human preference. However, the quality of RLHF and DPO training is seriously compromised by \textit{\textbf{C}orrupted} preference, reward \textit{\textbf{O}veroptimization}, and bias towards \textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only one of these important issues, and the few other works require much computation to estimate multiple reward models and lack theoretical guarantee of generalization ability. In this work, we propose RLHF-\textbf{COV} and DPO-\textbf{COV} algorithms that can simultaneously mitigate these three issues, in both offline and online settings. This ability is theoretically demonstrated by obtaining length-regularized generalization error rates for our DPO-COV algorithms trained on corrupted data, which match the best-known rates for simpler cases with clean data and without length regularization. Moreover, our DPO-COV algorithm is simple to implement without reward estimation, and is proved to be equivalent to our RLHF-COV algorithm, which directly implies the equivalence between the vanilla RLHF and DPO algorithms. Experiments demonstrate the effectiveness of our DPO-COV algorithms under both offline and online settings.

algorithm, artificial intelligence, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2510.05526

Country: North America > United States (0.67)

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

1533b2d13d0e0078fd193ec78ac3f8a5-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 19:11:02 GMT

general utility, international conference, use eq, (14 more...)

Neural Information Processing Systems

Country: North America > United States > Maryland > Prince George's County > College Park (0.04)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Trade-off in Estimating the Number of Byzantine Clients in Federated Learning

Chen, Ziyi, Zhang, Su, Huang, Heng

arXiv.org Artificial IntelligenceOct-7-2025

Federated learning has attracted increasing attention at recent large-scale optimization and machine learning research and applications, but is also vulnerable to Byzantine clients that can send any erroneous signals. Robust aggregators are commonly used to resist Byzantine clients. This usually requires to estimate the unknown number $f$ of Byzantine clients, and thus accordingly select the aggregators with proper degree of robustness (i.e., the maximum number $\hat{f}$ of Byzantine clients allowed by the aggregator). Such an estimation should have important effect on the performance, which has not been systematically studied to our knowledge. This work will fill in the gap by theoretically analyzing the worst-case error of aggregators as well as its induced federated learning algorithm for any cases of $\hat{f}$ and $f$. Specifically, we will show that underestimation ($\hat{f}

aggregator, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.04432

Country: North America > United States > Maryland (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Achieve Performatively Optimal Policy for Performative Reinforcement Learning

Chen, Ziyi, Huang, Heng

arXiv.org Artificial IntelligenceOct-7-2025

Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent's policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there is a provably positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time convergence to the desired PO} policy under the standard regularizer dominance condition. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point (not PS) of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace $Π_Δ$, where the policy value has a constant lower bound $Δ>0$ and thus the gradient becomes bounded and Lipschitz continuous. Experimental results also demonstrate that our 0-FW algorithm is more effective than the existing algorithms in finding the desired PO policy.

machine learning, reinforcement, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2510.0443

Country: North America > United States > Maryland (0.27)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Appendix T able of Contents

Neural Information Processing SystemsAug-17-2025, 09:01:37 GMT

C.2 Proof of item 2 for constrained Markov game Here we construct a counter example to prove item 2. Consider a constrained Markov game with

artificial intelligence, nulla, stochastic modification, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback

A Cubic Regularization Approach for Finding Local Minimax Points in Nonconvex Minimax Optimization

Chen, Ziyi, Hu, Zhengyang, Li, Qunwei, Wang, Zhe, Zhou, Yi

arXiv.org Artificial IntelligenceFeb-19-2023

Gradient descent-ascent (GDA) is a widely used algorithm for minimax optimization. However, GDA has been proved to converge to stationary points for nonconvex minimax optimization, which are suboptimal compared with local minimax points. In this work, we develop cubic regularization (CR) type algorithms that globally converge to local minimax points in nonconvex-strongly-concave minimax optimization. We first show that local minimax points are equivalent to second-order stationary points of a certain envelope function. Then, inspired by the classic cubic regularization algorithm, we propose an algorithm named Cubic-LocalMinimax for finding local minimax points, and provide a comprehensive convergence analysis by leveraging its intrinsic potential function. Specifically, we establish the global convergence of Cubic-LocalMinimax to a local minimax point at a sublinear convergence rate and characterize its iteration complexity. Also, we propose a GDA-based solver for solving the cubic subproblem involved in Cubic-LocalMinimax up to certain pre-defined accuracy, and analyze the overall gradient and Hessian-vector product computation complexities of such an inexact Cubic-LocalMinimax algorithm. Moreover, we propose a stochastic variant of Cubic-LocalMinimax for large-scale minimax optimization, and characterize its sample complexity under stochastic sub-sampling. Experimental results demonstrate faster convergence of our stochastic Cubic-LocalMinimax than some existing algorithms.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2110.07098

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Utah (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Momentum with Variance Reduction for Nonconvex Composition Optimization

Chen, Ziyi, Zhou, Yi

arXiv.org Machine LearningMay-15-2020

Composition optimization is widely-applied in nonconvex machine learning. Various advanced stochastic algorithms that adopt momentum and variance reduction techniques have been developed for composition optimization. However, these algorithms do not fully exploit both techniques to accelerate the convergence and are lack of convergence guarantee in nonconvex optimization. This paper complements the existing literature by developing various momentum schemes with SPIDER-based variance reduction for non-convex composition optimization. In particular, our momentum design requires less number of proximal mapping evaluations per-iteration than that required by the existing Katyusha momentum. Furthermore, our algorithm achieves near-optimal sample complexity results in both non-convex finite-sum and online composition optimization and achieves a linear convergence rate under the gradient dominant condition. Numerical experiments demonstrate that our algorithm converges significantly faster than existing algorithms in nonconvex composition optimization.

artificial intelligence, machine learning, optimization, (16 more...)

arXiv.org Machine Learning

2005.07755

Country: